mmap, the language go, problems with the linux kernel

martin capitanio

unread,

Feb 8, 2011, 7:38:22 AM2/8/11

to linux-...@vger.kernel.org, torv...@linux-foundation.org

There popped up a serious problem by implementing a fast memory
management for the language go. Maybe some experienced kernel hacker
could join the discussion and help to find the best linux solution for
the "mmap fiasco" problem.

https://groups.google.com/forum/#!msg/golang-dev/EpUlHQXWykg/LN2o9fV6R3wJ

Thanks (in behave of all linux go users:),
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Alan Cox

unread,

Feb 8, 2011, 8:23:45 AM2/8/11

to martin capitanio, linux-...@vger.kernel.org, torv...@linux-foundation.org

On Tue, 08 Feb 2011 13:37:58 +0100
martin capitanio <m...@capitanio.org> wrote:

> There popped up a serious problem by implementing a fast memory
> management for the language go. Maybe some experienced kernel hacker
> could join the discussion and help to find the best linux solution for
> the "mmap fiasco" problem.
>
> https://groups.google.com/forum/#!msg/golang-dev/EpUlHQXWykg/LN2o9fV6R3wJ
>
> Thanks (in behave of all linux go users:),

I don't actually see a problem.

Linux implements virtual address space limits, and enforces them. The go
language stuff wants to allocate huge amounts of virtual space so you
need to tell the OS you want to allow it to do crazy stuff, which you can
do so. But virtual address space is not free - it has to be tracked and
if the application suddenely tries to fill all of it what will happen ?

You'll hit problems if the kernel is running with vm overcommit disabled
(as well configured servers do),

There are of course ways and means - you can provide your own mmap to
override the libc one for example and manage the address space yourself -
within limits by allocating addresses and doing the syscall giving an
address request.

You'll be ok I suspect on Linux on x86 but there are platforms with very
complicated aliasing rules where the OS tries very hard to map certain
things at certain addresses to avoid cache aliasing work and big slow
downs. There are good reasons why mmap works the way it does.

Alan

Linus Torvalds

unread,

Feb 8, 2011, 11:24:25 AM2/8/11

to martin capitanio, linux-...@vger.kernel.org

On Tue, Feb 8, 2011 at 4:37 AM, martin capitanio <m...@capitanio.org> wrote:
>
> There popped up a serious problem by implementing a fast memory
> management for the language go. Maybe some experienced kernel hacker
> could join the discussion and help to find the best linux solution for
> the "mmap fiasco" problem.
>
> https://groups.google.com/forum/#!msg/golang-dev/EpUlHQXWykg/LN2o9fV6R3wJ

So, quite realistically, we can't change how "ulimit -v" works. It has
well-defined semantics, and they very much are about the mappings, not
about how many pages people use.

There's in theory a RLIMIT_RSS for tracking actual resident pages, but
in practice it doesn't _do_ anything on Linux, because it's not
something we've even bothered to count. It's much simpler and more
unambiguous to just count "how big are the mappings" than counting
individual pages. And as far as I can remember, this is literally the
first time that somebody has cared all that deeply (not to say that
people haven't asked for RSS before, but it's not been a fundamental
part of some design decision of theirs, just a wish-list).

So in theory we could change the kernel and start counting RSS, and
make RLIMIT_RSS do something useful, but in practice that would still
mean that it would take many _years_ before a project like 'go' could
rely on it, since most people don't change the kernel very often
anyway, and even if they did it's not the kernel that actually sets up
the offending RLIMIT_AS (the kernel defaults to "infinity"), but the
distribution or users random .bash_login files or whatever.

So even if the kernel _did_ change, you'd still have this problem in
'go', and you'd still need to do something else.

And quite frankly, I think your "use a big array" in go is a mistake.
You may think it's clever and simple, and that "hey, the OS won't
allocate pages we don't touch", but it's still a serious mistake. And
it's not a mistake because of RLIMIT_AS - that's just a secondary or
tertiary symptom of you being lazy and not doing the right thing.

Think about things like mlockall() (ever imaging mixing 'go' code with
C code that does security-sensitive stuff?).

Or think about things like the kernel trying to be really clever,
noticing that you have a 16GB allocation that is well-aligned, and
deciding to help you (since the system had tons of memory) by using
large pages for it to avoid excessive TLB overhead. Yes, people are
actually working on things like that. Suddenly the page allocation
granularity might be 2MB, not 4kB.

I bet there are other issues like that. On 32-bit, for example, we've
often had problems with people running out of virtual memory size,
since with shared libraries etc, there really isn't all that much free
address space. You only had a 256MB mapping on 32-bit, but quite
frankly, that's about 1/8th of the whole user address space (the 2G/2G
split tends to be normal), and you are basically requiring that there
is that much contiguous virtual address space that you can just waste.
Maybe that's true of all 'go' programs now, but I can tell you that in
the C world, people have done things like "let's link this binary
statically just so that we get maximal virtual address space size,
because we need a contiguous 1GB array for our actual _problem_).
Using some random 256MB virtual allocation just because your tracking
algorithm is lazy sounds like a _bad_ idea.

Finally, I actually think you may well often be better off keeping
your data denser (by using the indirection), and then having a smaller
virtual memory (and thus TLB) lookup footprint. Of course, it sounds
like your previous indexing scheme was very close to what the page
table lookup does anyway, but many problem sets have been better off
using fancy span-based lookup in order to _avoid_ having large arrays,
and the complexity of the scheme can be very much worth it.

In other words, the much deeper fundamental problem of the "one big
array" approach is that you're making tons of assumptions about what
is going on, and then when one of those assumptions aren't correct
("virtual memory size doesn't matter" in this case), you end up
blaming something else than your assumptions. And I think you need to
take another look at the assumption itself.

Linus

Martin Capitanio

unread,

Feb 9, 2011, 11:30:41 AM2/9/11

to Linus Torvalds, linux-...@vger.kernel.org, golang-dev, Russ Cox, Alan Cox, Albert Strasheim

On Tue, 2011-02-08 at 08:23 -0800, Linus Torvalds wrote:
> On Tue, Feb 8, 2011 at 4:37 AM, martin capitanio <m...@capitanio.org> wrote:
> >
> > There popped up a serious problem by implementing a fast memory
> > management for the language go. Maybe some experienced kernel hacker
> > could join the discussion and help to find the best linux solution for
> > the "mmap fiasco" problem.
> >
> > https://groups.google.com/forum/#!msg/golang-dev/EpUlHQXWykg/LN2o9fV6R3wJ
>

..

> And quite frankly, I think your "use a big array" in go is a mistake.
> You may think it's clever and simple, and that "hey, the OS won't
> allocate pages we don't touch", but it's still a serious mistake. And
> it's not a mistake because of RLIMIT_AS - that's just a secondary or
> tertiary symptom of you being lazy and not doing the right thing.

..

So, I hope I managed now to put all the involved on the cc list. Here
are the relevant responses I've got from the other ml. I think
there is still a confusion what the mmap syscall actually should
do in the case of PROT_NONE (Data cannot be accessed)
http://pubs.opengroup.org/onlinepubs/009695399/functions/mmap.html

On Wed, 2011-02-09 at 09:57 -0500, Russ Cox wrote:
Thanks for posting the LKML response.
> Most of what Linus says is true but probably not
> crucial enough to avoid laziness for now. We can
> always change the strategy later if it becomes a
> problem.
>
> The comment about large pages would be the most
> important reason not to do what we're doing but sounds
> more like a kernel bug than our fault. We're being
> very up front with the kernel about which memory we
> are and are not using: what we're not using has prot==0.
> If Linux sees a 16 GB prot==0 mapping and decides to
> dedicate >0 bytes of memory to backing it, then that's
> not our problem.
>
> Other tools like Native Client use enormous prot==0
> mappings. I doubt Linux would ever make the mistake
> of giving them real amounts of physical memory.

On Tue, 2011-02-08 at 13:26 +0000, Alan Cox wrote:
..

> Linux implements virtual address space limits, and enforces them. The go
> language stuff wants to allocate huge amounts of virtual space so you
> need to tell the OS you want to allow it to do crazy stuff, which you can
> do so. But virtual address space is not free - it has to be tracked and
> if the application suddenely tries to fill all of it what will happen ?
>
> You'll hit problems if the kernel is running with vm overcommit disabled
> (as well configured servers do),
>
> There are of course ways and means - you can provide your own mmap to
> override the libc one for example and manage the address space yourself -
> within limits by allocating addresses and doing the syscall giving an
> address request.
>
> You'll be ok I suspect on Linux on x86 but there are platforms with very
> complicated aliasing rules where the OS tries very hard to map certain
> things at certain addresses to avoid cache aliasing work and big slow
> downs. There are good reasons why mmap works the way it does.

..

On Wed, 2011-02-09 at 07:26 -0800, Albert Strasheim wrote:
> I'm a bit concerned about Alan Cox's comment:

>
> "You'll hit problems if the kernel is running with vm overcommit

> disabled (as well configured servers do)."
>
> We are planning to do exactly that, on a server that will be running
> many, many Go processes.
>
> But maybe virtual memory with prot==0 doesn't factor into the
> overcommit accounting?
..

Russ Cox

unread,

Feb 9, 2011, 11:41:03 AM2/9/11

to Martin Capitanio, Linus Torvalds, linux-...@vger.kernel.org, golang-dev, Alan Cox, Albert Strasheim

I don't think there is that much more to say but thanks for
assembling the To: line.

Go is still very much an experimental project. It is fine
in our opinion to try things and see how they work.
We're happy to revisit design decisions if some of the
possible negatives that have been identified come to pass.

I agree with what Linus posted about it being of only very
long-term utility to change the kernel interface, and probably
not worth doing at all. I think it's unfortunate (at least for the
people who think ulimit -v is useful for keeping your machine
from swapping) that mmap with PROT_NONE counts against
ulimit -v, but it is what it is.

In Alan's scenario about vm_overcommit, since that is a
Linux-specific feature and presumably more malleable, I would
hope that the "commit" charge doesn't happen until you do
mmap with prot != PROT_NONE. As I said in some of the
quoted text, there are various sandboxes like Native Client
or VX32 that assume they can use mmap as a way to set up
restricted sub-address spaces at low cost, and I don't see
the benefit to committing the physical memory before the
addresses are mapped accessible.

Russ

Ted Ts'o

unread,

Feb 9, 2011, 2:18:09 PM2/9/11

to Martin Capitanio, Linus Torvalds, linux-...@vger.kernel.org, golang-dev, Russ Cox, Alan Cox, Albert Strasheim

On Wed, Feb 09, 2011 at 05:30:19PM +0100, Martin Capitanio wrote:
> So, I hope I managed now to put all the involved on the cc list. Here
> are the relevant responses I've got from the other ml. I think
> there is still a confusion what the mmap syscall actually should
> do in the case of PROT_NONE (Data cannot be accessed)
> http://pubs.opengroup.org/onlinepubs/009695399/functions/mmap.html

Actually, I don't think the confusion has anything to do with
PROT_NONE. The Go designers have themselves said that their intent
was to reserve the virtual address space. So that much is clear.

The real quesiton is what does RLIMIT_AS and ulimit -v supposed to
*do*. The Single Unix Specification (and POSIX, which is where this
comes from), is quite vague: "the maximum size of a process's total
available memory, in bytes". What in the world is "total available
memory"?!? BSD also has RLIMIT_RSS, which was not adopted by Posix
(not surprising, given that in the early days it was dominated by
System V folks).

AIX and the BSD's don't implement RLIMIT_AS at all. Solaris does, but
the man page just says "total available memory", again without
specifying what that means. Solaris also has a RLIMIT_VMEM, which is
the total amount of virtual address space, so apparently Solaris seems
to think that RLIMIT_VMEM and RLIMIT_AS are different things.

Linux has interpreted RLIMIT_AS to mean total amount of virtual
address space for a long, long time. (The interpretation AS ==
"address space" does make sense, although it's not clear that's what
the original definition of RLIMIT_AS was supposed to mean.) Linux
also has a RLIMIT_RSS, probably taken from BSD, which is not
implemented (although if you are using memory cgroups, you can
effectively get the same result as limiting a process's RSS, although
via different API).

Bash has definied rlimit -v to mean "total amount of virtual memory"
and implements it via RLIMIT_AS, so it's pretty clear that its intent
was that rlimit -v is supposed to mean "virtual address space". (Or
maybe it was documented that way and the letter 'v' chosen because
that's what RLIMIT_AS has meant on Linux for a long time.)

The bottom line is that so long as Go's memory management system is
intending to reserve virtual address space, there is no real conflict
in the question of what PROT_NONE means. Both Linux and Go intend it
to mean, "reserve address space". The better line of argumentation
from the Go perspective is that RLIMIT_AS shouldn't mean restricting
the virtual address space, but "something else". But that would mean
changing Linux's behavior, which has been established for many, many
years. And arguably the specification is vague at best. (What does
"available memory" mean, anyway? Does it mean physical memory?
physical memory plus whatever swap space happens to be available?
Does VM overcommit be taken into account --- what if every single page
in every single copy of the 'ftpd' binary gets attached by a debugger
and modified?)

Linux has interpreted it to mean "virtual address space", and in fact
it's documented as such in the its version of the getrlimit man page.
I'd have to agree with Linus that it's probably way too late to change
what it means (or what Linux thinks it means, anyway).

In any case, it's deployed on so many machines that any change would
take years to roll out anyway. What I'd probably recommend to Go
developers is to check the value of RLIMIT_AS via getrlimit(), and if
it's too small for what you want, print a human-readable error or
warning message telling the user to limit the RLIMIT_AS, and then
either stop, or use some alternate allocation strategy.

- Ted

Ian Lance Taylor

unread,

Feb 9, 2011, 2:56:50 PM2/9/11

to Ted Ts'o, Martin Capitanio, Linus Torvalds, linux-...@vger.kernel.org, golang-dev, Russ Cox, Alan Cox, Albert Strasheim

"Ted Ts'o" <ty...@mit.edu> writes:

> Linux has interpreted it to mean "virtual address space", and in fact
> it's documented as such in the its version of the getrlimit man page.
> I'd have to agree with Linus that it's probably way too late to change
> what it means (or what Linux thinks it means, anyway).

I don't think anybody seriously expects Linux to change the meaning of
ulimit -v at this point. Obviously Go is going to do something
different here.

However, I think it's still worth pointing out that while ulimit -v no
doubt has specialized applications, it does not do exactly what I think
most people want. I think most people want some way to say "do not let
this program cause my machine to start thrashing." That's what I use
ulimit -v for; if I don't, a program which accidentally allocates memory
in an endless loop starts thrashing. But I don't actually care how much
virtual memory the program is using; what I care about is limiting the
amount of physical memory it is using, so that it doesn't take over my
machine.

I think that would be a useful feature to implement regardless of how we
feel about ulimit -v and Go. I think we can reasonably expect more and
more programs to try to advantage of large virtual address spaces. Lets
have a way to use them while still having a way to keep them from
thrashing.

Ian

Ted Ts'o

unread,

Feb 9, 2011, 3:11:29 PM2/9/11

to Ian Lance Taylor, Martin Capitanio, Linus Torvalds, linux-...@vger.kernel.org, golang-dev, Russ Cox, Alan Cox, Albert Strasheim

On Wed, Feb 09, 2011 at 11:56:13AM -0800, Ian Lance Taylor wrote:
>
> However, I think it's still worth pointing out that while ulimit -v no
> doubt has specialized applications, it does not do exactly what I think
> most people want. I think most people want some way to say "do not let
> this program cause my machine to start thrashing." That's what I use
> ulimit -v for; if I don't, a program which accidentally allocates memory
> in an endless loop starts thrashing. But I don't actually care how much
> virtual memory the program is using; what I care about is limiting the
> amount of physical memory it is using, so that it doesn't take over my
> machine.

Agreed, I don't think ulimit -v is particularly useful for much of
anything, especially in these days of 64-bit systems where we have
gobs and gobs of address space.

If the performance hit isn't that horrible, making Linux enforce
RLIMIT_RSS is probably the right answer for the "do not let this
program cause my machine to start thrashing". But even that doesn't
help if you have some pesky program that fires up large number of
processes, like, say, Chrome. :-)

So it's not a per-process limit that we really want; instead, what we
want to do is put a program like Chrome into its own container group,
and then use memcg to constrain how much memory all of the processes
in that container group is allowed to use. And we can also use that
same abstraction to control how much scheduler and I/O bandwidth
programs in that container are allowed to use as well.

> I think that would be a useful feature to implement regardless of how we
> feel about ulimit -v and Go. I think we can reasonably expect more and
> more programs to try to advantage of large virtual address spaces. Lets
> have a way to use them while still having a way to keep them from
> thrashing.

I think we do have a way of doing that. The kernel side support for
that is there, and a number of companies are using that to keep
programs from using too much physical memory. What's missing is the
userspace tools to make the right thing happen automatically when you
double click on an "Open Office" or "Chrome" icon on your desktop.

- Ted

Florian Weimer

unread,

Feb 12, 2011, 9:50:25 AM2/12/11

to Alan Cox, martin capitanio, linux-...@vger.kernel.org, torv...@linux-foundation.org

* Alan Cox:

> Linux implements virtual address space limits, and enforces them. The go
> language stuff wants to allocate huge amounts of virtual space so you
> need to tell the OS you want to allow it to do crazy stuff, which you can
> do so. But virtual address space is not free - it has to be tracked and
> if the application suddenely tries to fill all of it what will happen ?
>
> You'll hit problems if the kernel is running with vm overcommit disabled
> (as well configured servers do),

The odd thing is that prot==0 does *not* count against the
vm.overcommit_memory=2 limit, only against ulimit -v. The limit is
only enforced for the parts on which mprotect is called. I think this
should really be part of the public API (I'm not sure if it is right
now, it could well be an accident), to avoid the problems you
describe.

Ted Ts'o

unread,

Feb 12, 2011, 8:22:49 PM2/12/11

to Florian Weimer, Alan Cox, martin capitanio, linux-...@vger.kernel.org, torv...@linux-foundation.org

On Sat, Feb 12, 2011 at 03:28:37PM +0100, Florian Weimer wrote:
> * Alan Cox:
>
> > Linux implements virtual address space limits, and enforces them. The go
> > language stuff wants to allocate huge amounts of virtual space so you
> > need to tell the OS you want to allow it to do crazy stuff, which you can
> > do so. But virtual address space is not free - it has to be tracked and
> > if the application suddenely tries to fill all of it what will happen ?
> >
> > You'll hit problems if the kernel is running with vm overcommit disabled
> > (as well configured servers do),
>
> The odd thing is that prot==0 does *not* count against the
> vm.overcommit_memory=2 limit, only against ulimit -v. The limit is
> only enforced for the parts on which mprotect is called. I think this
> should really be part of the public API (I'm not sure if it is right
> now, it could well be an accident), to avoid the problems you
> describe.

The overcommit_memory logic does not include any pages which are
mapped read-only. Technically that's not quite enough --- in theory
you could have a debugging attach to every single read-only text page
and set breakpoints on every single page. Digital's OSF/1 operating
system went to such lengths, which meant that you if you were running
(say) an FTP server where you might have hundreds of connections at
the same time, you would need to have enough swap space for every
single ftpd's text page as if they had been modified --- even though
in practice that never happened.

So it's not just prot==0 pages which are not counted; read-only pages
are not counted, either. This probably falls in the category of
"implementation detail", though. If and when we start having
instances where huge number of breakpoints of userspace kprobes get
set (say, if Systemtap actually gets wide use and the userspace probes
patch actually makes it into mainline), we might have to change the
details of how we deal with the accounting. I'm not sure it's worth
it to specify in great detail how things are done at this point, since
in the future it's possible that we might want to change them.

- Ted

Hannes Frederic Sowa

unread,

Feb 16, 2011, 1:16:31 PM2/16/11

to Ted Ts'o, Martin Capitanio, Linus Torvalds, linux-...@vger.kernel.org, golang-dev, Russ Cox, Alan Cox, Albert Strasheim

On Wed, Feb 9, 2011 at 8:17 PM, Ted Ts'o <ty...@mit.edu> wrote:
> AIX and the BSD's don't implement RLIMIT_AS at all. Solaris does, but
> the man page just says "total available memory", again without
> specifying what that means. Solaris also has a RLIMIT_VMEM, which is
> the total amount of virtual address space, so apparently Solaris seems
> to think that RLIMIT_VMEM and RLIMIT_AS are different things.

Actually, no:
| ./uts/common/sys/resource.h:#define RLIMIT_AS RLIMIT_VMEM

They have a userland daemon called rcapd which enforces rss-limits on
process-groups by paging out their data.

Florian Weimer

unread,

Feb 16, 2011, 3:51:48 PM2/16/11

to Ted Ts'o, Alan Cox, martin capitanio, linux-...@vger.kernel.org, torv...@linux-foundation.org

* Ted Ts'o:

>> The odd thing is that prot==0 does *not* count against the
>> vm.overcommit_memory=2 limit, only against ulimit -v. The limit is
>> only enforced for the parts on which mprotect is called. I think this
>> should really be part of the public API (I'm not sure if it is right
>> now, it could well be an accident), to avoid the problems you
>> describe.
>
> The overcommit_memory logic does not include any pages which are
> mapped read-only.

A colleague tells me that according to his tests, this depends on the
history of the page, as expected.

> Technically that's not quite enough --- in theory you could have a
> debugging attach to every single read-only text page and set
> breakpoints on every single page.

Those cases do not matter because setting a breakpoint can fail with
ENOMEM. You only have to take into account possibly future operations
which cannot fail with ENOMEM.